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EXPERIMENTAL PHILOSOPHY LEADING TO A SMALL SCALE 
DIGITAL DATA BASE OF THE CONTERMINOUS UNITED STATES FOR 
DESIGNING EXPERIMENTS WITH REMOTELY SENSED DATA 

ABSTRACT 

Researcli using satellite remotely sensed data, even within any single scientific discipline, has 
often lacked a unifying principle or strategy with which to plan or integrate studies conducted over 
an area so large that exhaustive examination is infeasible, e.g,, the U.S.A. However, such a series of 
studies would seem to be at the heart of what makes satellite remote sensing unique, that is the 
ability to select for study fror imong remotely sensed data sets distributed widely over the U.S., 
over time, where the resources do not exist to examine all of them. What we do lack is the pre- 
viously noted strategy to aid in the development of formal testable hypotheses and the selection of 
study locations so as to minimize the number of samples subject to the ability to construct desired 
inferences. 

Using this philosophical underpinning and the concept of a unifying principle, we have con- 
structed an operational procedure for developing a sampling strategy and formal testable hypotheses. 

We believe the procedure to be applicable across disciplines, when the investigator restates the re- 
search question in symbolic form, i.e, quantifies it. 

The procedure is set within the statistical framework of general linear models. The dependent 
variable is any arbitrary function of remotely sensed data and the independent variables are values 
or levels of factors which represent regional climatic conditions and/or properties of the earth’s sur- 
face. These factors are operationally defined as maps from the U.S. National Atlas (U.S.G.S., 1970). 
Eighty-five maps from the National Atlas, representing climatic and surface attributes, were auto- 
mated by point counting at an effective resolution of one observation every 17.6 km (1 1 miles) 
yielding 22,505 observations per map. The maps were registered to one another in a two step pro- 
cedure producing a coarse, then fine scale registration. After registration, the maps were iteratively 
checked for errors using manual and automated procedures. The “error-free” maps were anotated 
with identification and legend information and then stored as card images, one map to a file. 

A sampling design will be accomplished through a regionalization analysis of the National Atlas 
data base (presently being conducted). From this analysis a map of “homogeneous regions” of the 
U.S.A. will be created and samples (Landsat scenes) assigned by region. 

While designed for use with remote sensing experiments, the data base, the method of analyzing 
it and the philosophy behind it are general enough to serve as a framework for other studies being 
conducted over large portions of the U.S.A. 
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EXPERIMENTAL PHILOSOPHY LEADING TO A SMALL SCALE 
DIGITAL DATA BASE OF THE CONTERMINOUS UNITED STATES FOR 
DESIGNING EXPERIMENTS WITH REMOTELY SENSED DATA 


MOTIVATION AND OBJECTIVE 

To date satellite remote sensing has been characterized by repetitive coverage of large areas 
resulting in large volumes of data. As such this data set is unique and still unexamined for its 
potential utiLty in the detection of patterns over large regions, over time. Conceptually, the types 
of patterns we wish to detect in the remotely sensed data are those we can relate to the variation in 
regional climatic conditions, and/or properties of the earth’s surface. These patterns are often inti- 
mately associated with the solutions to research problems in several disciplines. For example, 
classifiers and decision rules are applied to remotely sensed data in a variety of disciplines. An 
important question to address then becomes, does the goodness of fit of a particular classifier vary 
as a function of location and time of year? Other examples of patterns of interest could be impor- 
tant inputs into setting parameters of a remote sensing system or the strategy for its use. These 
would include the amount of data compression achievable over a given location, or, with pointable 
systems, the number of times that an area should be imaged over a given time period. 

There would be little dispute that the answer is yes to the question, do such variables vary with 
location and time of year? However, there is also little information in such an answer. Therefore in 
order to examine the variables fo- such patterns, the hypotheses are equivalenccd to a scries of 
analyses involving, for example, changes in frequency distributions of digital numbers with location 
(frequency distributions are fundamental attributes of classifiers), or measures of related ness of 
neighboring pixels such as the autocorrelation function (related to data compress), spatial and tex- 
tural measures. To complete the search for patterns in these functions of the remote sensing data, 
the functions are modeled as functions of suitably operationalized ancillary factors, i.e. attributes 
over space and time of the ground location. 
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In such an analysis scheme the patterns of interest are computationally intensive, and the prob- 
lem of detecting patterns is further complicated by the vast quantities and the cost of data. There- 
fore, any analysis scheme must be automated (hence quantitative) and use only a portion or sample 
of the available data. 

While a great deal of research with satellite remotely sensed data has been performed in a num- 
ber of disciplines, little thought has been given a priori on how to integrate studios at different loca- 
tions, Even less thought lias been given to selecting a series of study sites, a priori, based upon some 
unifying conception. The approach of combining “grab” samples to infer any pattern might be 
called a “bottom up” design and is a very inefficient, potentially biased sampling design. The alter- 
native is what might be called a “top-down” experimental design. The objective of this paper is to 
present a data base and philosophy which we are using in just such a “top down” approach as the 
rationale for selecting samples (Landsat scenes) and for generating testable hypotheses about varia- 
tion in remotely sensed data. 

THE ANALYSIS SETTING 

Methods for selecting representative subsets of individuals and discerning patterns of variation 
are formalized within the realm of statistics and we shall motivate the data base within such a frame- 
work. More specifically, the target population is the conterminous United States. The individual 
elements of the population are the 80 m pixels described by Landsats 1, 2 and 3. The variables of 
interest are the reflected energies occurring in the four spectral bands, recorded by the satellite at 
each pixel. We might then set about collecting a simple random sample of pixels. However, such a 
sampling scheme is neither practical nor efficient for discerning patterns. An alternative is to group 
pixels first and then sample within the group. Reasons are as follows: 

1. The geographic or spatial position of a pixel is a fundamental property of the data set. 

2. Some of the derived variables are functions of groups of pixels. 

3. A pixel is not precisely the same location for every pass of the satellite. 
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4. As Landsat data for any given pixel is only available by acquiring a nominal scene, it would 
be reasonable to use the path-row system of nominal scenes as a grouping factor. 

Clearly we are describing a multistage sampling strategy with important stages being initially the 
selection of representative sets of scenes and secondly, pixels within a scene. Further, we have re- 
gressed a portion of the sampling question to another scale - the selection of scenes. A strategy for 
this selection is developed below. 

The adoption of the path-row scheme may also be viewed as a stratification of the population. 
In general, the rationale for imposing a stratification upon a population is that the researcher be- 
lieves, that when the elements of the population are grouped according to a set of attributes other 
than the random variable(s) under consideration, the within group variation of the random variables 
is less than that for any combination of two or more of the groups. 

Therefore, by judicious selection of grouping of factors, the researcher can formulate and test 
hypotheses about the population. Indeed, in this experimental environment, a stratification is 
equivalent to a hypothesis. Note that the discovery of a stratification which satisfactorily reduces 
the ratio of the within group variance to the total variance is precisely what we mean by the discern- 
ment of patterns. 

Such an experiment can be effected operationally by modeling the value of the random vari- 
ables as a general linear model. Under this approach the strata are known as treatments which are 
combinations of specific values of the grouping factors. Once the treatments are established, the 
random variables are expressed as linear functions of the effects of grouping factors and the inter- 
actions between these factors. For example: 

Y lj k = M + ai+0j+a0 i j+e i j k 

is a general linear model, 
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where: 

is the random variable measured on the kth element of the i, j treatment (stratum); 

H is an overall mean; 

aj is the effect or contribution to the random variable from the ith level of factor A, which 
occurs at I levels (i = 1 , . . . , I): 

is the contribution to the random variables from the jth level of factor B, which occurs at 
J levels G - 1 J)’> 

afy is an interaction arising from factor A occurring at the ith level and factor B occurring at 

the jth level; 

e t j k is an error term. 

In this model, every combination of a specific i and j represents a treatment and from our previous 
argument defines a particular subpopulation or stratum. Analyses of variance (ANOVA) theory 
(Scheffe, 1959) provides the method for partitioning variation in the random variables among fac- 
tors and test statistics which can be used to determine which factors are explaining significant 
amounts of variation. 

In addition to representing a set of hypotheses to be tested, a particular stratification can be 
used to determine the total number and allocation of samples. For example, the total number of 
samples should be proportional to the total number of strata with the number of samples allocated 
to a particular stratum equal to the proportion of the total population represented by the stratum 
times the total number of samples. 

OPERATIONALIZING THE APPROACH: 

The Factors 

Clearly, the problem of setting up a stratification in order to formulate hypotheses or select 
Landsat scenes (samples) is now recast as the selection of factors and the description of the dis- 
tribution of their values (levels) over the population. In setting up a stratification for use with 
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remote sensing reasonable factors to use Include climatic variables, properties of the land surface 
and their interactions. These factors were operationally defined as the maps of the National Atlas 
produced by the U.S. Geological Survey (USGS, 1970). We selected 85 maps from the National 
Atlas (see Table 1) and transformed them into data sets we could manipulate. This quantification 
was achieved by placing transparent grids on each map and recording a code for the value of the 
map occurring at each row column interaction, a procedure commonly called point counting. All 
maps were on an Albers equal area projection base (Mating, 1973), so square grids were used to 
point count all the maps. 

The maps which were point counted possessed three scales. Maps having a scale of 1 to 
7,500,000 or 1 to 17,000,000 were point counted so that an observation was taken every 17.6 km 
(1 1 miles) across the U.S. A. A third set of maps displaying monthly climatic variables was mapped 
at a scale of I to 34,000,000. For these maps grids were used so that an observation was recorded 
every 35.2 km (22 mi.) across the U.S.A. Since these climatic maps possess broad contours, wc were 
fairly confident in doubling the maps in both an east-west and north-south direction to achieve the 
17.6 km data collection increment. Some of the contour lines on the “quadrupled" maps are saw- 
toothed, but considering the degree of smoothing, the standards for map accuracy as well as the 
reproduction methods, one might argue that these generated contour lines are clearly within the 
level of uncertainty of the maps. 

Each point counted map consists of a rectanw-lar array 154 rows by 258 columns, yielding 
39,732 points. Of these points, 17,227 are coded as outside the study area (for example, the 1st 
and 154th rows are completely outside the study area). These points lie either in open ocean, 
Canada or Mexico. For the remaining 22,505 points, 1 1 85 points are coded as inland water. In- 
cluded in the inland water category are not only interior lakes, but also bays and bodies of water on 
the landward side of islands, water between barrier islands and water between peninsulas and the 
coast. 
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Table 1 

Maps in National Atlas Data Base 
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Table 1 (continued) 
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'Maps listed as having a resolution of 35.2km were doubled as described in the text, so that the final effective resolution for all maps is 17.6 fcnu 



The maps were registered to one another in two steps, A coarse registration was accomplished 
by use of reference points on the base maps to position the transparent point counting grids, In the 
second registration step each map was registered to a truth mask, The truth mask was a separately 
point counted map containing only three map symbols: 

1. outside the study area, 

2. study area excluding inland water, and 

3. inland water, 

The application of tills truth mask to each map registered the boundaries of the study area and the 
Inland water bodies, 

Errors within the study area were detected using a combination of automated and manual 
checks. Firstly, an automated procedure was used to flag every singleton cell and absolute differ* 
cnccs In the numeric codes of neighboring cells greater than a selected tolerance. Next, cither a 
shade print or an image produced on the Interactive Digital Image Manipulation System (IDIMS) 
was compared to the original map, Errors were noted, corrected and the resulting map was reviewed 
again. 

PHYSICAL ARRANGEMENT OF MAP DATA 

Tiie maps are available on magnetic tape, eacli map composing u file. The contents of a file are 
given in Table 2. The first record of each file gives the map number as per Table 2 and the page of 
the National Atlas, on which the map can be found. For maps covering more than one page, the 
page number is the first page. The next record has the words MAP TITLE in the first nine columns 
followed by the title in columns 12 through 80. The next set of records contains the map legend, 
one record per map symbol, The symbol numeric code is an integer value right justified in columns 
1 to 4. This value is followed in columns 6 through 80 by the definition of the symbol. The end of 
every legend is denoted by the numeric code 0 which is the code for “outside study area” in each 
map. In the legend the map symbols are arranged in order of increasing numeric values until the 
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Tabic 2 

Setup of Records Within a Map File 


1 11 '■■■ >— — — — — jtr, -f 

Record 1, Map identification Number 


Column 1 

MAP NO. AAA FROM PAGE [or SHEET] III OF THE NATIONAL ATLAS, 

1970 EDITION 

AAA is a right justified alphanumeric field 
III is a right Justified integer field 
Record 2 +, Title Record 
Column 1 

MAP TITLE: followed by title and information applicable to entire map. 

Note the Title Record may be continued on subsequent cards and continuation will be 
indicated by ** ,,, * in columns 77-80. 

j Record 3 To 3 + NC, Legend Records 

A legend record will occur for the number of codes (NC) in each rnap. For these 
records, the numeric code associated with each map symbol is right justified in the first 
four columns. This code is followed by a blank and the meaning of the code. For maps 
in which the code represents an interval of values, the information following the numeric 
code is as follows: • 

Column 6 

DDDDj TO DDDD 2 

where: DDDD ( are real or integer values of the end points. 

Note a Legend Record may be continued on subsequent cards and continuations will be 
indicated by ***’" in column 77-80, 

Note 0 is the code for outside study area in ail maps and is the east numeric code in 
tire legend, 

Records 3 + NC + 1 To 3 + NC + 2002, Rows of Map 
Thirteen records per row. 

First twelve, 20 fields of 14. 

Thirteen record, 18 fields of 14. 


0 code. Thus either the decrease In value or the 0 could be used as a flae to Indicate the end of the 
legend. Both the MAP TITLE and individual legend records may be continued with in 
columns 77-80, denoting the continuation, The map number, title, and legend arc made up of 
EBCDIC records of length 80. The next records arc the rows of the map. Each of these records 
consists of 20 integer values, with each integer value right justified in a field of length four. Tiius 
each map row is composed of 13 records with the entire map, exclusive of the header information, 
requiring 2002 records. Like the earlier records, the map rows are EBCDIC. As stated above, the 
first and last rows of each map are completely outside the study area, so, for example, the first 13 
map records possess 258 0 values, 

Three examples of the maps from the data base are given in Figure 1. 

USE OF THE DATA BASE 

Wiiiie at this time we have not concluded analysis of this data base, we outline below, in gen- 
eral terms our analysis strategy (see Figure 2). A further treatment of this topic will be given in a 
future report. 

The first step in the analysis consists of variable reduction. For each location in the grid there 
are 85 values, which are far too m- - y to be considered together when selecting samples or for con- 
structing a general linear model. The variable reduction can be accomplished in two steps. 

Conceptually, we can portray the cross tabulation of the factors in the data base as a hyper- 
dimensional data cube, an example of which for three dimensions is given in Figure 3. The cells of 
the cube contain the number of grid points possessing the combination of factor levels represented 
by the cell. The first step of variable reduction is qualitative, in that the number of values that any 
map possesses can be reduced by examining the data cube and map legends and deleting those levels 
and combination of levels which are unimportant or trivial for the objectives of the research being 
conducted. A natural extension of this step is the elimination of entire maps. However, most of 


11 



UWGH4AL PAOc IS 

OF POOR QUALITY 



Figure 1(a) Fxamples from National Atlas Data Base - Geolog> (Sheets 74-75) 
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Figure 1(b). Examples from National Atlas Data Base - Potential Nature Vegetation (Sheets 90-91) 
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Figure 1(c). Examples from National Atlas Data Base - Topographic Relief (Sheet 59) 
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Figure 2. Generalized Strategy for Analysis of National Atlas Data Base 
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Figure 3. Example of Conceptual Crossing of Factors from National Atlas Data Base 

the maps should pass through this initial sieve unaltered. In the next step a variety of quantitative 
methods will be used to construct subsets of the 85 maps such that members of a subset have simi- 
lar patterns of variation or are closely correlated with one another. These methods are included 
under: 

1. ANOVA related techniques (Greig-Smith, 1964) used to assign variation according to scale, 
and 

2. measures which result in the construction of similarity matricies (Everett, 1974), which are 
inputs for non-parametric methods or procedures for the analysis of multilevel data sets, 
such as multidimensional scaling (Kruskal, 1964). 
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Once the subsets of maps are constructed, maps representative of each subset can be randomly 
selected, We will cluster this reduced data set, incorporating information from a series of contin- 
guity or proximity measures (Cliff and Ord, 1973) into clustering algorithms. The outcome of 
these analyses will be a hierarchical series of clusters representing “homogeneous regions” of the 
U.S.A, From among these “homogeneous regions,” the proportion of the population covered by 
each region may be calculated and an experimental design may be constructed. The results could be 
used in turn to make an estimate of the total number of samples needed and their spatial distribution. 

SUMMARY AND CONCLUSIONS 

We have developed a data base composed of 85 maps for the U.S.G.S, National Atlas. The 
data base is motivated by the need for a rationale to select representative areas of the U.S.A. for 
examination by remote sensing. The maps included in the data base are those which might reason- 
ably be believed to influence remote sensing and display properties of the land surface and climatic 
conditions over the conterminous U.S.A. The maps were point counted or transformed so that an 
observation was made every 17.6 km (1 1 mi.) across the U.S.A. This resulted in a grid of 154 rows 
by 258 columns for each map yielding 39,732 points of which 22,505 are within the study area. 

The maps were registered to another and then iteratively error checked using manual and automated 
approaches. The “error-free” maps were placed on magnetic tape one map per file, each map pre- 
cede in the file by identification information and a legend of map symbols. In future work we will 
use the data base to create a map of “homogeneous” regions. Such a map will repiesent a stratifica- 
tion of the population. The validity of this stratification will be examined by random sampling 
scenes from the strata, sampling pixels from the scenes, and then testing the significance of the 
grouping factors (which formed the stratification) using techniques from the field of general linear 
models, particularly analyses of variance. 

We have stressed the utility of the data base for the selection of areas the size of scenes or 
greater. A second, and perhaps a third stage, of sampling is needed to complete the data collection. 
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Wc have not yet investigated strategies for these additional stages, This recognition of the presence 
of multiple sampling scales is related to the recognition that the grouping factors which will define 
the homogeneous regions will be unlikely to account for all the variation in the remotely sensed 
data. However, the amount of variation explained by these factors is still unknown and will be 
important in designing smart or onboard satellite processing as well as in analyzing previously col- 
lected data. Such work will be important for questions pertaining to optimizing temporal and spec- 
tral resolutions. Finally, although we have considered the data base solely within a remote sensing 
context, it and the philosophy behind it clearly have utility for other studies where the researcher 
desires to make inferences about large geographic regions. 
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16. Abstract 1 

Research using satellite remotely sensed data, even within any single scientific discipline, has often lacked a unifying prin- 
ciple or strategy with which to plan or integrate studies conducted over on area so large that exhaustive examination is infeasible, 
c.g*, the U.S. A, However, such a screes of studies would seem to bo at tho heart of what makes satellite remote sensing unique, 1 

that Is the ability to select for study from among remotely sensed data sets distributed widely over the U.S,, over time, where 
the resources do not exist to examine all of them. What we do luck is tho previously noted strategy to aid in the development of 
formal testable hypotheses and tho selection of study locations so as to minimize tho number of samples subject to the ability to 
construct desired inferences. 

Using tills philosophical underpinning and the concept of a unifying principle, we have constructed on operational procedure 
for developing a sampling strategy and formal testable hypotheses. Wc believe the procedure to be applicable across disciplines, 
when the investigator restates the research question in symbolic form, l.c. quantifies it. 

The procedure is set within tho statistical framework of general linear models, The dependent variable Is any arbitrary 
function of remotely sensed data and the independent variables are values or levels of factors which represent regional climatic 
conditions and/or properties of the oarth’s surface, These factors arc operationally defined as maps from the U.S, National 
Atlas (US.G.S., 1970). Elghty-fivo maps from the National Atlas, representing climatic and surface attributes, were automated 
by point counting at an effective resolution of one observation every 17,6 km (11 miles) yielding 22,505 observations per map. 

The maps were registered to one another in a two step procedure producing a coarse, then fine scolo registration. After regis- 
tration, the maps wore iteratively checked for errors using manual and automated procedures, Tho “error-free* 1 maps were 
anotated with identification and legend information and then stored as card images, one map to a file. 

A sampling design will bo accomplished through a regionalization analysis of the National Atlas data base (presently being 
conducted). From this analysis a map of “homogeneous regions’* of the U.S. A. will be created and samples (Landsat scenes) 
assigned by region. 

Willie designed for use with remote sonsing experiments, the data base, tho method of analyzing it and the philosophy 
behind It are general enough to serve os a framework for other studies being conducted oyer large portions of the U,S,A. 
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